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ABSTRACT 

Several anonymization techniques, such as generalization 
and bucketization, have been designed for privacy preserving 
microdata publishing. Recent work has shown that general- 
ization loses considerable amount of information, especially 
for high-dimensional data. Bucketization, on the other hand, 
does not prevent membership disclosure and does not apply 
for data that do not have a clear separation between quasi- 
identifying attributes and sensitive attributes. 

In this paper, we present a novel technique called slicing, 
which partitions the data both horizontally and vertically. 
We show that slicing preserves better data utility than gen- 
eralization and can be used for membership disclosure pro- 
tection. Another important advantage of slicing is that it 
can handle high-dimensional data. We show how slicing can 
be used for attribute disclosure protection and develop an ef- 
ficient algorithm for computing the sliced data that obey the 
^-diversity requirement. Our workload experiments confirm 
that slicing preserves better utility than generalization and 
is more effective than bucketization in workloads involving 
the sensitive attribute. Our experiments also demonstrate 
that slicing can be used to prevent membership disclosure. 

1. INTRODUCTION 

Privacy-preserving publishing of microdata has been stud- 
ied extensively in recent years. Microdata contains records 
each of which contains information about an individual en- 
tity, such as a person, a household, or an organization. 
Several microdata anonymization techniques have been pro- 
posed. The most popular ones are generalization [29] [31] 
for fc-anonymity [31] and bucketization [35] [25] [16] for l- 
diversity [23]. In both approaches, attributes are partitioned 
into three categories: (1) some attributes are identifiers that 
can uniquely identify an individual, such as Name or Social 
Security Number, (2) some attributes are Quasi-Identifiers 
( QI), which the adversary may already know (possibly from 
other publicly-available databases) and which, when taken 
together, can potentially identify an individual, e.g., Birth- 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. 



date, Sex, and Zipcode; (3) some attributes are Sensitive 
Attributes (SAs), which are unknown to the adversary and 
are considered sensitive, such as Disease and Salary. 

In both generalization and bucketization, one first removes 
identifiers from the data and then partitions tuples into 
buckets. The two techniques differ in the next step. Gener- 
alization transforms the Ql-values in each bucket into "less 
specific but semantically consistent" values so that tuples in 
the same bucket cannot be distinguished by their QI val- 
ues. In bucketization, one separates the SAs from the QIs 
by randomly permuting the SA values in each bucket. The 
anonymized data consists of a set of buckets with permuted 
sensitive attribute values. 

1.1 Motivation of Slicing 

It has been shown |T| 1151 [35] that generalization for fc- 
anonymity losses considerable amount of information, espe- 
cially for high-dimensional data. This is due to the following 
three reasons. First, generalization for fc-anonymity suffers 
from the curse of dimensionality. In order for generalization 
to be effective, records in the same bucket must be close to 
each other so that generalizing the records would not lose too 
much information. However, in high-dimensional data, most 
data points have similar distances with each other, forcing a 
great amount of generalization to satisfy fc-anonymity even 
for relative small fc's. Second, in order to perform data 
analysis or data mining tasks on the generalized table, the 
data analyst has to make the uniform distribution assump- 
tion that every value in a generalized interval/set is equally 
possible, as no other distribution assumption can be justi- 
fied. This significantly reduces the data utility of the gen- 
eralized data. Third, because each attribute is generalized 
separately, correlations between different attributes are lost. 
In order to study attribute correlations on the generalized 
table, the data analyst has to assume that every possible 
combination of attribute values is equally possible. This is 
an inherent problem of generalization that prevents effective 
analysis of attribute correlations. 

While bucketization [351 I25[ 116] has better data utility 
than generalization, it has several limitations. First, buck- 
etization does not prevent membership disclosure [27]. Be- 
cause bucketization publishes the QI values in their original 
forms, an adversary can find out whether an individual has 
a record in the published data or not. As shown in [31] , 
87% of the individuals in the United States can be uniquely 
identified using only three attributes (Birthdate, Sex, and 
Zipcode). A microdata (e.g., census data) usually contains 
many other attributes besides those three attributes. This 
means that the membership information of most individuals 



can be inferred from the bucketized table. Second, buck- 
etization requires a clear separation between QIs and SAs. 
However, in many datasets, it is unclear which attributes are 
QIs and which are SAs. Third, by separating the sensitive 
attribute from the QI attributes, bucketization breaks the 
attribute correlations between the QIs and the SAs. 

In this paper, we introduce a novel data anonymization 
technique called slicing to improve the current state of the 
art. Slicing partitions the dataset both vertically and hori- 
zontally. Vertical partitioning is done by grouping attributes 
into columns based on the correlations among the attributes. 
Each column contains a subset of attributes that are highly 
correlated. Horizontal partitioning is done by grouping tu- 
ples into buckets. Finally, within each bucket, values in each 
column are randomly permutated (or sorted) to break the 
linking between different columns. 

The basic idea of slicing is to break the association cross 
columns, but to preserve the association within each col- 
umn. This reduces the dimensionality of the data and pre- 
serves better utility than generalization and bucketization. 
Slicing preserves utility because it groups highly-correlated 
attributes together, and preserves the correlations between 
such attributes. Slicing protects privacy because it breaks 
the associations between uncorrelated attributes, which are 
infrequent and thus identifying. Note that when the dataset 
contains QIs and one SA, bucketization has to break their 
correlation; slicing, on the other hand, can group some QI at- 
tributes with the SA, preserving attribute correlations with 
the sensitive attribute. 

The key intuition that slicing provides privacy protection 
is that the slicing process ensures that for any tuple, there 
are generally multiple matching buckets. Given a tuple t = 
{vi, V2, ■ ■ ■ , v c ), where c is the number of columns, a bucket is 
a matching bucket for t if and only if for each i (1 < i < c), 
Vi appears at least once in the i'th column of the bucket. 
Any bucket that contains the original tuple is a matching 
bucket. At the same time, a matching bucket can be due to 
containing other tuples each of which contains some but not 
all Hi's. 

1.2 Contributions & Organization 

In this paper, we present a novel technique called slicing 
for privacy-preserving data publishing. Our contributions 
include the following. 

First, we introduce slicing as a new technique for privacy 
preserving data publishing. Slicing has several advantages 
when compared with generalization and bucketization. It 
preserves better data utility than generalization. It pre- 
serves more attribute correlations with the SAs than bucke- 
tization. It can also handle high-dimensional data and data 
without a clear separation of QIs and SAs. 

Second, we show that slicing can be effectively used for 
preventing attribute disclosure, based on the privacy re- 
quirement of ^-diversity. We introduce a notion called £- 
diverse slicing, which ensures that the adversary cannot 
learn the sensitive value of any individual with a probability 
greater than l/£. 

Third, we develop an efficient algorithm for computing 
the sliced table that satisfies ^-diversity. Our algorithm par- 
titions attributes into columns, applies column generaliza- 
tion, and partitions tuples into buckets. Attributes that are 
highly-correlated are in the same column; this preserves the 
correlations between such attributes. The associations be- 



tween uncorrelated attributes are broken; the provides bet- 
ter privacy as the associations between such attributes are 
less-frequent and potentially identifying. 

Fourth, we describe the intuition behind membership dis- 
closure and explain how slicing prevents membership disclo- 
sure. A bucket of size k can potentially match k c tuples 
where c is the number of columns. Because only k of the 
k c tuples are actually in the original data, the existence of 
the other k c — k tuples hides the membership information of 
tuples in the original data. 

Finally, we conduct extensive workload experiments. Our 
results confirm that slicing preserves much better data util- 
ity than generalization. In workloads involving the sensitive 
attribute, slicing is also more effective than bucketization. 
In some classification experiments, slicing shows better per- 
formance than using the original data (which may overfit 
the model). Our experiments also show the limitations of 
bucketization in membership disclosure protection and slic- 
ing remedies these limitations. 

The rest of this paper is organized as follows. In Section^ 
we formalize the slicing technique and compare it with gen- 
eralization and bucketization. We define ^-diverse slicing for 
attribute disclosure protection in Section [3] and develop an 
efficient algorithm to achieve ^-diverse slicing in Section 2] 
In Section [5] we explain how slicing prevents membership 
disclosure. Experimental results are presented in Section [6] 
and related work is discussed in Section [7] We conclude the 
paper and discuss future research in Section [8] 

2. SLICING 

In this section, we first give an example to illustrate slic- 
ing. We then formalize slicing, compare it with general- 
ization and bucketization, and discuss privacy threats that 
slicing can address. 

Table [T] shows an example microdata table and its 
anonymized versions using various anonymization tech- 
niques. The original table is shown in Table QJa). The 
three QI attributes are {Age, Sex, Zipcode}, and the sensi- 
tive attribute SA is Disease. A generalized table that satis- 
fies 4-anonymity is shown in Table [TJb) , a bucketized table 
that satisfies 2-diversity is shown in Table QJc) , a general- 
ized table where each attribute value is replaced with the 
the multiset of values in the bucket is shown in Table QJd), 
and two sliced tables are shown in Table [He) and QJf). 

Slicing first partitions attributes into columns. Each col- 
umn contains a subset of attributes. This vertically parti- 
tions the table. For example, the sliced table in Table []Jf) 
contains 2 columns: the first column contains {Age, Sex} 
and the second column contains {Zipcode, Disease}. The 
sliced table shown in Table QJe) contains 4 columns, where 
each column contains exactly one attribute. 

Slicing also partition tuples into buckets. Each bucket 
contains a subset of tuples. This horizontally partitions the 
table. For example, both sliced tables in Table [He) and 
Table [TJf ) contain 2 buckets, each containing 4 tuples. 

Within each bucket, values in each column are randomly 
permutated to break the linking between different columns. 
For example, in the first bucket of the sliced table shown in 
Table Hlf), the values {(22, M), (22, F), (33, F), (52, F)} are 
randomly permutated and the values {(47906, dyspepsia), 
(47906, flu), (47905, flu), (47905, bronchitis)} are randomly 
permutated so that the linking between the two columns 
within one bucket is hidden. 



Age 


Sex 


Zipcode 


Disease 




Age 


Sex 


Zipcode 


Disease 


22 


M 


47906 


dyspepsia 




[20-52] 




4790* 


dyspepsia 


22 


F 


47906 


flu 




[20-52] 




4790* 


flu 


33 


F 


47905 


flu 




[20-52] 


* 


4790* 


flu 


52 


F 


47905 


bronchitis 




[20-52] 


* 


4790* 


bronchitis 


54 


M 


47302 


flu 




[54-64] 




4730* 


flu 


60 


M 


47302 


dyspepsia 




[54-64] 




4730* 


dyspepsia 


60 


M 


47304 


dyspepsia 




[54-64] 




4730* 


dyspepsia 


64 


F 


47304 


gastritis 




[54-64] 




4730* 


gastritis 



(a) The original table 



Age 


Sex 


Zipcode 


Disease 


22 


M 


47906 


flu 


22 


F 


47906 


dyspepsia 


33 


F 


47905 


bronchitis 


52 


F 


47905 


flu 


54 


M 


47302 


gastritis 


60 


M 


47302 


flu 


60 


M 


47304 


dyspepsia 


64 


F 


47304 


dyspepsia 



(b) The generalized table 



(c) The bucketized table 



Age 


Sex 


Zipcode 


Disease 




Age 


Sex 


Zipcode 


Disease 


22:2,33:1,52:1 


M:1,F:3 


47905:2,47906:2 


dysp. 




22 


F 


47906 


flu 


22:2,33:1,52:1 


M:1,F:3 


47905:2,47906:2 


flu 




22 


M 


47905 


flu 


22:2,33:1,52:1 


M:1,F:3 


47905:2,47906:2 


flu 




33 


F 


47906 


dysp. 


22:2,33:1,52:1 


M:1,F:3 


47905:2,47906:2 


bron. 




52 


F 


47905 


bron. 


54:1,60:2,64:1 


M:3,F:1 


47302:2,47304:2 


flu 




54 


M 


47302 


dysp. 


54:1,60:2,64:1 


M:3,F:1 


47302:2,47304:2 


dysp. 




60 


F 


47304 


gast. 


54:1,60:2,64:1 


M:3,F:1 


47302:2,47304:2 


dysp. 




60 


M 


47302 


dysp. 


54:1,60:2,64:1 


M:3,F:1 


47302:2,47304:2 


gast. 




64 


M 


47304 


flu 



(Age,Sex) 


(Zipcode, Disease) 


(22,M) 
(22,F) 
(33,F) 
(52,F) 


(47905,flu) 
(47906,dysp.) 
(47905,bron.) 

(47906,flu) 


(54,M) 
(60,M) 
(60,M) 
(64,F) 


(47304,gast.) 

(47302,flu) 
(47302,dysp.) 
(47304,dysp.) 



(d) Multiset-based generalization 



(e) One-attribute-per-column slicing 



(f) The sliced table 



Table 1: An original microdata table and its anonymized versions using various anonymization techniques 



2.1 Formalization of Slicing 

Let T be the microdata table to be published. T contains 
d attributes: A = {Ai, A2, . . . , Ad} and their attribute do- 
mains are {D[A!], D[A 2 ], . . . , D[A d }}. A tuple t e T can 
be represented as t = (t[j4.i], t\A^, ■ t[Ad]) where t[Ai] 
(1 < i < d) is the Ai value of t. 

Definition 1 (Attribute partition and columns). 
An attribute partition consists of several subsets of A, 
such that each attribute belongs to exactly one subset. Each 
subset of attributes is called a column. Specifically, let 
there be c columns Ci, Ci, ■ ■ ■ , C c , then Ui =1 d = A and for 
any 1 < i\ ^ i% < c, d 1 n Ci 2 = 0. 

For simplicity of discussion, we consider only one sensi- 
tive attribute S. If the data contains multiple sensitive at- 
tributes, one can either consider them separately or consider 
their joint distribution [23]. Exactly one of the c columns 
contains S. Without loss of generality, let the column that 
contains 5* be the last column C c . This column is also called 
the sensitive column. All other columns {Ci, C2, . . . , C c _i} 
contain only QI attributes. 

Definition 2 (Tuple partition and buckets). A 
tuple partition consists of several subsets of T , such 
that each tuple belongs to exactly one subset. Each subset 
of tuples is called a bucket. Specifically, let there be b 
buckets Bi, B2, . . . , Bb, then u| =1 Bi = T and for any 
1 < ii / is < b, B n n B i2 = 0. 

Definition 3 (Slicing). Given a microdata table T, a 
slicing ofT is given by an attribute partition and a tu- 
ple partition. 

For example, Table Q^e) an d Table QJf) are two sliced 
tables. In Table [TJe), the attribute partition is {{Age}, 
{Sex}, {Zipcode}, {Disease}} and the tuple partition is 
{{ti, t2, tz, £4}, {t 5 , te, tr, ts}}. In Table QJf), the attribute 
partition is {{Age, Sex}, {Zipcode, Disease}} and the tuple 
partition is {{ti, ti, t 3 , U}, {t 5 , t e , tr, t 8 }}. 

Often times, slicing also involves column generalization. 



Definition 4 (Column Generalization). Given a 
microdata table T and a column C\ = {An, Ai2, ■ ■ ■ , Aij}, a 
column generalization for C; is defined as a set of non- 
overlapping j -dimensional regions that completely cover 
D[Ai\\ x D[Ai2\ x ... x £)[Ay]. A column generalization 
maps each value of d to the region in which the value is 
contained. 

Column generalization ensures that one column satisfies 
the fc-anonymity requirement. It is a multidimensional en- 
coding [T7] and can be used as an additional step in slic- 
ing. Specifically, a general slicing algorithm consists of the 
following three phases: attribute partition, column general- 
ization, and tuple partition. Because each column contains 
much fewer attributes than the whole table, attribute parti- 
tion enables slicing to handle high-dimensional data. 

A key notion of slicing is that of matching buckets. 

Definition 5 (Matching Buckets). Let 
{Ci, C2, ■ ■ ■ , C c } be the c columns of a sliced table. 
Let t be a tuple, and t[C\] be the d value oft. Let B be a 
bucket in the sliced table, and B[d] be the multiset of d 
values in B. We say that B is a matching bucket of t iff 
for alll<i<c, t[d] G B[d]. 

For example, consider the sliced table shown in TableQJf), 
and consider ti = (22, M, 47906, dyspepsia). Then, the set 
of matching buckets for ti is {B\}. 

2.2 Comparison with Generalization 

There are several types of recodings for generalization. 
The recoding that preserves the most information is local 
recoding. In local recoding, one first groups tuples into buck- 
ets and then for each bucket, one replaces all values of one 
attribute with a generalized value. Such a recoding is local 
because the same attribute value may be generalized differ- 
ently when they appear in different buckets. 

We now show that slicing preserves more information than 
such a local recoding approach, assuming that the same tu- 
ple partition is used. We achieve this by showing that slicing 



is better than the following enhancement of the local recod- 
ing approach. Rather than using a generalized value to re- 
place more specific attribute values, one uses the multiset of 
exact values in each bucket. For example, Table \V[h) is a 
generalized table, and Table QJd) is the result of using mul- 
tisets of exact values rather than generalized values. For the 
Age attribute of the first bucket, we use the multiset of ex- 
act values {22,22,33,52} rather than the generalized interval 
[22 — 52]. The multiset of exact values provides more in- 
formation about the distribution of values in each attribute 
than the generalized interval. Therefore, using multisets of 
exact values preserves more information than generalization. 

However, we observe that this multiset-based generaliza- 
tion is equivalent to a trivial slicing scheme where each 
column contains exactly one attribute, because both ap- 
proaches preserve the exact values in each attribute but 
break the association between them within one bucket. For 
example, Table [lie) is equivalent to Table Hid). Now com- 
paring Table [He) with the sliced table shown in Table QJf), 
we observe that while one-attribute-per-column slicing pre- 
serves attribute distributional information, it does not pre- 
serve attribute correlation, because each attribute is in its 
own column. In slicing, one groups correlated attributes 
together in one column and preserves their correlation. For 
example, in the sliced table shown in Table[]Jf), correlations 
between Age and Sex and correlations between Zipcode and 
Disease are preserved. In fact, the sliced table encodes the 
same amount of information as the original data with regard 
to correlations between attributes in the same column. 

Another important advantage of slicing is its ability to 
handle high-dimensional data. By partitioning attributes 
into columns, slicing reduces the dimensionality of the data. 
Each column of the table can be viewed as a sub-table with 
a lower dimensionality. Slicing is also different from the 
approach of publishing multiple independent sub-tables in 
that these sub-tables are linked by the buckets in slicing. 

2.3 Comparison with Bucketization 

To compare slicing with bucketization, we first note that 
bucketization can be viewed as a special case of slicing, 
where there are exactly two columns: one column contains 
only the SA, and the other contains all the QIs. The ad- 
vantages of slicing over bucketization can be understood as 
follows. First, by partitioning attributes into more than two 
columns, slicing can be used to prevent membership dis- 
closure. Our empirical evaluation on a real dataset shows 
that bucketization does not prevent membership disclosure 
in Section HJ] 

Second, unlike bucketization, which requires a clear sep- 
aration of QI attributes and the sensitive attribute, slicing 
can be used without such a separation. For dataset such as 
the census data, one often cannot clearly separate QIs from 
SAs because there is no single external public database that 
one can use to determine which attributes the adversary al- 
ready knows. Slicing can be useful for such data. 

Finally, by allowing a column to contain both some QI 
attributes and the sensitive attribute, attribute correlations 
between the sensitive attribute and the QI attributes are 
preserved. For example, in Table QJf), Zipcode and Disease 
form one column, enabling inferences about their correla- 
tions. Attribute correlations are important utility in data 
publishing. For workloads that consider attributes in isola- 
tion, one can simply publish two tables, one containing all 



QI attributes and one containing the sensitive attribute. 

2.4 Privacy Threats 

When publishing microdata, there are three types of pri- 
vacy disclosure threats. The first type is membership disclo- 
sure. When the dataset to be published is selected from a 
large population and the selection criteria are sensitive (e.g., 
only diabetes patients are selected) , one needs to prevent ad- 
versaries from learning whether one's record is included in 
the published dataset. 

The second type is identity disclosure, which occurs when 
an individual is linked to a particular record in the released 
table. In some situations, one wants to protect against iden- 
tity disclosure when the adversary is uncertain of member- 
ship. In this case, protection against membership disclo- 
sure helps protect against identity disclosure. In other sit- 
uations, some adversary may already know that an indi- 
vidual's record is in the published dataset, in which case, 
membership disclosure protection either does not apply or 
is insufficient. 

The third type is attribute disclosure, which occurs when 
new information about some individuals is revealed, i.e., the 
released data makes it possible to infer the attributes of an 
individual more accurately than it would be possible before 
the release. Similar to the case of identity disclosure, we 
need to consider adversaries who already know the mem- 
bership information. Identity disclosure leads to attribute 
disclosure. Once there is identity disclosure, an individual 
is re-identified and the corresponding sensitive value is re- 
vealed. Attribute disclosure can occur with or without iden- 
tity disclosure, e.g., when the sensitive values of all matching 
tuples are the same. 

For slicing, we consider protection against membership 
disclosure and attribute disclosure. It is a little unclear how 
identity disclosure should be denned for sliced data (or for 
data anonymized by bucketization), since each tuple resides 
within a bucket and within the bucket the association across 
different columns are hidden. In any case, because identity 
disclosure leads to attribute disclosure, protection against 
attribute disclosure is also sufficient protection against iden- 
tity disclosure. 

We would like to point out a nice property of slicing that 
is important for privacy protection. In slicing, a tuple can 
potentially match multiple buckets, i.e., each tuple can have 
more than one matching buckets. This is different from pre- 
vious work on generalization and bucketzation, where each 
tuple can belong to a unique equivalence-class (or bucket). 
In fact, it has been recognized [1] that restricting a tuple in a 
unique bucket helps the adversary but does not improve data 
utility. We will see that allowing a tuple to match multiple 
buckets is important for both attribute disclosure protection 
and attribute disclosure protection, when we describe them 
in Section 13 and Section [S] respectively. 

3. ATTRIBUTE DISCLOSURE PROTEC- 
TION 

In this section, we show how slicing can be used to prevent 
attribute disclosure, based on the privacy requirement of £- 
diversity and introduce the notion of ^-diverse slicing. 

3.1 Example 

We first give an example illustrating how slicing satisfies 
^-diversity [23] where the sensitive attribute is "Disease". 



The sliced table shown in Table []Jf) satisfies 2-diversity. 
Consider tuple ti with QI values (22, M, 47906). In order 
to determine ti's sensitive value, one has to examine ti's 
matching buckets. By examining the first column (Age, Sex) 
in Table [Tf f ) , we know that t 1 must be in the first bucket 
Bi because there are no matches of (22, M) in bucket B 2 . 
Therefore, one can conclude that ti cannot be in bucket B 2 
and ti must be in bucket B\ . 

Then, by examining the Zipcode attribute of the second 
column (Zipcode, Disease) in bucket B\, we know that the 
column value for t\ must be either (47906, dyspepsia) or 
(47906, flu) because they are the only values that match 
ti's zipcode 47906. Note that the other two column values 
have zipcode 47905. Without additional knowledge, both 
dyspepsia and flu are equally possible to be the sensitive 
value of t\. Therefore, the probability of learning the cor- 
rect sensitive value of ti is bounded by 0.5. Similarly, we 
can verify that 2-diversity is satisfied for all other tuples in 
Table [Hf). 

3.2 ^-Diverse Slicing 

In the above example, tuple ti has only one matching 
bucket. In general, a tuple t can have multiple matching 
buckets. We now extend the above analysis to the general 
case and introduce the notion of ^-diverse slicing. 

Consider an adversary who knows all the QI values of t 
and attempts to infer t's sensitive value from the sliced table. 
She or he first needs to determine which buckets t may reside 
in, i.e., the set of matching buckets of t. Tuple t can be in any 
one of its matching buckets. Let p(t, B) be the probability 
that t is in bucket B (the procedure for computing p(t, B) 
will be described later in this section). For example, in the 
above example, p(ti, Bi) = 1 and p(ti, B2) = 0. 

In the second step, the adversary computes p(t,s), the 
probability that t takes a sensitive value s. p(t, s) is cal- 
culated using the law of total probability. Specifically, let 
p(s\t,B) be the probability that t takes sensitive value s 
given that t is in bucket B, then according to the law of 
total probability, the probability p(t, s) is: 



p(t,a)=Y^P{t,B)p( 8 \t,B) 



(1) 



In the rest of this section, we show how to compute the 
two probabilities: p(t,B) and p(s\t, B). 

Computing p(t,B). Given a tuple t and a sliced bucket 
B, the probability that t is in B depends on the fraction 
of t's column values that match the column values in B. If 
some column value of t does not appear in the corresponding 
column of B, it is certain that t is not in B. In general, 
bucket B can potentially match \B\ C tuples, where |B| is 
the number of tuples in B. Without additional knowledge, 
one has to assume that the column values are independent; 
therefore each of the \B\ C tuples is equally likely to be an 
original tuple. The probability that t is in B depends on the 
fraction of the \B\ C tuples that match t. 

We formalize the above analysis. We consider the match 
between t's column values {£[Ci], f[C2], • • • , t[C c ]} and B's 
column values {B[d], B[C 2 ], ■ ■ ■ ,B[C C ]}. Let fi(t, B) (1 < 
i < c — 1) be the fraction of occurrences of t[Ci\ in B[d] 
and let f c (t, B) be the fraction of occurrences of t[C c — {S}] 
in B[C C - {S}]). Note that, C c - {S} is the set of QI at- 
tributes in the sensitive column. For example, in Table[ljf), 



fi(ti,Bi) = 1/4 = 0.25 and f 2 (ti,B 1 ) = 2/4 = 0.5. Simi- 
larly, fi(ti,B 2 ) = and f 2 (ti,B 2 ) = 0. Intuitively, fi(t,B) 
measures the matching degree on column C%, between tuple 
t and bucket B. 

Because each possible candidate tuple is equally likely to 
be an original tuple, the matching degree between t and B 
is the product of the matching degree on each column, i.e., 
f(t,B) = ni<i< c Note that = 1 and 

when B is not a matching bucket of t, f(t, B) — 0. 

Tuple t may have multiple matching buckets, t's total 
matching degree in the whole data is f(t) — f(t,B). 
The probability that t is in bucket B is: 

f(t,B) 



p(t,B) 



f(t) 



Computing p(s\t,B). Suppose that t is in bucket B, 
to determine t's sensitive value, one needs to examine the 
sensitive column of bucket B. Since the sensitive column 
contains the QI attributes, not all sensitive values can be 
t's sensitive value. Only those sensitive values whose QI 
values match t's QI values are t's candidate sensitive values. 
Without additional knowledge, all candidate sensitive values 
(including duplicates) in a bucket are equally possible. Let 
D(t, B) be the distribution of t's candidate sensitive values 
in bucket B. 

Definition 6 (D (t, B)). Any sensitive value that is as- 
sociated with t[C c — {S}] in B is a candidate sensitive 
value for t (there are f c (t,B) candidate sensitive values for 
t in B, including duplicates) . Let D(t,B) be the distribution 
of the candidate sensitive values in B and D(t,B)[s] be the 
probability of the sensitive value s in the distribution. 

For example, in Table QJf), D(t\,B\) = (dyspepsia : 
0.5, flu : 0.5) and therefore D(ti, Bi)[dyspepsia] = 0.5. The 
probability p(s\t,B) is exactly D(t,B)[s], i.e., p(s\t,B) — 
D(t,B)[s]. 

^-Diverse Slicing. Once we have computed p(t, B) and 
p(s\t, B), we are able to compute the probability p(t, s) based 
on the Equation |T]). We can show when t is in the data, the 
probabilities that t takes a sensitive value sum up to 1. 

Fact 1. For any tuple t € D, X] s p(t,s) = 1. 
Proof. 

^p(t, S ) = ^^p(t,B)p( S |t,B) 

S 3 B 

B s (2) 



□ 

^-Diverse slicing is defined based on the probability p(t, s). 

Definition 7 (^-diverse slicing). A tuple t satisfies 
i-diversity iff for any sensitive value s, 

Pit, a) < 1/t 

A sliced table satisfies l-diversity iff every tuple in it satisfies 
i-diversity. 



Our analysis above directly show that from an ^-diverse 
sliced table, an adversary cannot correctly learn the sensitive 
value of any individual with a probability greater than 1 /I. 
Note that once we have computed the probability that a 
tuple takes a sensitive value, we can also use slicing for other 
privacy measures such as f-closeness [20] . 

4. SLICING ALGORITHMS 

We now present an efficient slicing algorithm to achieve 
^-diverse slicing. Given a microdata table T and two param- 
eters c and i, the algorithm computes the sliced table that 
consists of c columns and satisfies the privacy requirement 
of ^-diversity. 

Our algorithm consists of three phases: attribute parti- 
tioning, column generalization, and tuple partitioning. We 
now describe the three phases. 

4.1 Attribute Partitioning 

Our algorithm partitions attributes so that highly- 
correlated attributes are in the same column. This is good 
for both utility and privacy. In terms of data utility, group- 
ing highly-correlated attributes preserves the correlations 
among those attributes. In terms of privacy, the association 
of uncorrelated attributes presents higher identification risks 
than the association of highly-correlated attributes because 
the association of uncorrelated attribute values is much less 
frequent and thus more identifiable. Therefore, it is better 
to break the associations between uncorrelated attributes, 
in order to protect privacy. 

In this phase, we first compute the correlations between 
pairs of attributes and then cluster attributes based on their 
correlations. 

4.1.1 Measures of Correlation 

Two widely-used measures of association are Pearson cor- 
relation coefficient [6] and mean-square contingency coeffi- 
cient [B]. Pearson correlation coefficient is used for mea- 
suring correlations between two continuous attributes while 
mean-square contingency coefficient is a chi-square mea- 
sure of correlation between two categorical attributes. We 
choose to use the mean-square contingency coefficient be- 
cause most of our attributes are categorical. Given two 
attributes A\ and A2 with domains {vu , V12, v\d 1 } and 
{v2i,V22, V2d 2 } , respectively. Their domain sizes are thus 
di and 0I2, respectively. The mean-square contingency coef- 
ficient between Ai and A2 is defined as: 

Here, fa and f.j are the fraction of occurrences of Vu 
and V2j in the data, respectively, fa is the fraction of co- 
occurrences of Vu and V2j in the data. Therefore, /j. and 
f.j are the marginal totals of fa: = X^jLi fa an d f-j = 

J2iLi fa- lt can be shown that < <?!> 2 (Ai, A 2 ) < 1. 

For continuous attributes, we first apply discretization to 
partition the domain of a continuous attribute into intervals 
and then treat the collection of interval values as a discrete 
domain. Discretization has been frequently used for decision 
tree classification, summarization, and frequent itemset min- 
ing. We use equal-width discretization, which partitions an 
attribute domain into (some k) equal-sized intervals. Other 



methods for handling continuous attributes are the subjects 
of future work. 

4.1.2 Attribute Clustering 

Having computed the correlations for each pair of at- 
tributes, we use clustering to partition attributes into 
columns. In our algorithm, each attribute is a point in the 
clustering space. The distance between two attributes in the 
clustering space is defined as d(A\,A2) = 1 — 2 (Ai, A2), 
which is in between of and 1. Two attributes that are 
strongly-correlated will have a smaller distance between the 
corresponding data points in our clustering space. 

We choose the fc-medoid method for the following rea- 
sons. First, many existing clustering algorithms (e.g., k- 
means) requires the calculation of the "centroids". But there 
is no notion of "centroids" in our setting where each attribute 
forms a data point in the clustering space. Second, fc-medoid 
method is very robust to the existence of outliers (i.e., data 
points that are very far away from the rest of data points). 
Third, the order in which the data points are examined does 
not affect the clusters computed from the fc-medoid method. 
We use the well-known fc-medoid algorithm PAM (Partition 
Around Medoids) [14] . PAM starts by an arbitrary selection 
of k data points as the initial medoids. In each subsequent 
step, PAM chooses one medoid point and one non-medoid 
point and swaps them as long as the cost of clustering de- 
creases. Here, the clustering cost is measured as the sum 
of the cost of each cluster, which is in turn measured as the 
sum of the distance from each data point in the cluster to the 
medoid point of the cluster. The time complexity of PAM 
is 0(k(n — k) 2 ). Thus, it is known that PAM suffers from 
high computational complexity for large datasets. However, 
the data points in our clustering space are attributes, rather 
than tuples in the microdata. Therefore, PAM will not have 
computational problems for clustering attributes. 

4.1.3 Special Attribute Partitioning 

In the above procedure, all attributes (including both QIs 
and SAs) are clustered into columns. The fc-medoid method 
ensures that the attributes are clustered into k columns but 
does not have any guarantee on the size of the sensitive col- 
umn C c - In some cases, we may pre-determine the number of 
attributes in the sensitive column to be a. The parameter a 
determines the size of the sensitive column C c , i.e., |C C | = a. 
If a — 1, then |C C | = 1, which means that C c = {&}■ And 
when c — 2, slicing in this case becomes equivalent to buck- 
etization. If a > 1, then \C C \ > 1, the sensitive column also 
contains some QI attributes. 

We adapt the above algorithm to partition attributes into 
c columns such that the sensitive column C c contains a at- 
tributes. We first calculate correlations between the sensi- 
tive attribute 5* and each QI attribute. Then, we rank the 
QI attributes by the decreasing order of their correlations 
with S and select the top a — 1 QI attributes. Now, the sen- 
sitive column C c consists of 5* and the selected QI attributes. 
All other QI attributes form the other c — 1 columns using 
the attribute clustering algorithm. 

4.2 Column Generalization 

In the second phase, tuples are generalized to satisfy some 
minimal frequency requirement. We want to point out that 
column generalization is not an indispensable phase in our 
algorithm. As shown by Xiao and Tao |35| . bucketization 



Algorithm tuple-partition(T, £) 

1. Q = {T}; SB = 0. 

2. while Q is not empty 

3. remove the first bucket B from Q; Q — Q — {B}. 

4. split B into two buckets Bi and Bi, as in Mondrian. 

5. if diversity-check(T, Q U {B 1 ,B 2 } U SB, £) 

6. Q = QU{Bi,B 2 }. 

7. else SB = SBU {B}. 

8. return SB. 

Figure 1: The tuple-partition algorithm 

provides the same level of privacy protection as generaliza- 
tion, with respect to attribute disclosure. 

Although column generalization is not a required phase, 
it can be useful in several aspects. First, column general- 
ization may be required for identity /membership disclosure 
protection. If a column value is unique in a column (i.e., 
the column value appears only once in the column), a tuple 
with this unique column value can only have one matching 
bucket. This is not good for privacy protection, as in the case 
of generalization/bucketization where each tuple can belong 
to only one equivalence-class/bucket. The main problem is 
that this unique column value can be identifying. In this 
case, it would be useful to apply column generalization to 
ensure that each column value appears with at least some 
frequency. 

Second, when column generalization is applied, to achieve 
the same level of privacy against attribute disclosure, bucket 
sizes can be smaller (see Section f-4 . 3 p - While column gener- 
alization may result in information loss, smaller bucket-sizes 
allows better data utility. Therefore, there is a trade-off be- 
tween column generalization and tuple partitioning. In this 
paper, we mainly focus on the tuple partitioning algorithm. 
The tradeoff between column generalization and tuple par- 
titioning is the subject of future work. Existing anonymiza- 
tion algorithms can be used for column generalization, e.g., 
Mondrian |17] . The algorithms can be applied on the sub- 
table containing only attributes in one column to ensure the 
anonymity requirement. 

4.3 Tuple Partitioning 

In the tuple partitioning phase, tuples are partitioned into 
buckets. We modify the Mondrian 117 algorithm for tuple 
partition. Unlike Mondrian fc-anonymity, no generalization 
is applied to the tuples; we use Mondrian for the purpose of 
partitioning tuples into buckets. 

Figure [T] gives the description of the tuple-partition algo- 
rithm. The algorithm maintains two data structures: (1) 
a queue of buckets Q and (2) a set of sliced buckets SB. 
Initially, Q contains only one bucket which includes all tu- 
ples and SB is empty (line 1). In each iteration (line 2 to 
line 7), the algorithm removes a bucket from Q and splits 
the bucket into two buckets (the split criteria is described 
in Mondrian |17|). If the sliced table after the split satisfies 
^-diversity (line 5) , then the algorithm puts the two buckets 
at the end of the queue Q (for more splits, line 6). Other- 
wise, we cannot split the bucket anymore and the algorithm 
puts the bucket into SB (line 7). When Q becomes empty, 
we have computed the sliced table. The set of sliced buckets 
is SB (line 8). 

The main part of the tuple-partition algorithm is to check 
whether a sliced table satisfies ^-diversity (line 5). Figure [5] 
gives a description of the diversity-check algorithm. For each 



Algorithm diversity-check(T, T*, £) 

1. for each tuple t € T, L[t] = 0. 

2. for each bucket B in T* 

3. record f(v) for each column value v in bucket B. 

4. for each tuple t e T 

5. calculate p(t,B) and find D(t,B). 

6. L[t]=L[t]U{{p(t,B),D(t,B)}}. 

7. for each tuple t£T 

8. calculate p(t,s) for each s based on L[t]. 

9. if p(t, s) > return false. 

10. return true. 

Figure 2: The diversity-check algorithm 

tuple t, the algorithm maintains a list of statistics L[t] about 
t's matching buckets. Each element in the list L[t] contains 
statistics about one matching bucket B: the matching prob- 
ability p(t, B) and the distribution of candidate sensitive 
values D(t, B). 

The algorithm first takes one scan of each bucket B (line 2 
to line 3) to record the frequency f(v) of each column value 
v in bucket B. Then the algorithm takes one scan of each 
tuple t in the table T (line 4 to line 6) to find out all tuples 
that match B and record their matching probability p(t, B) 
and the distribution of candidate sensitive values D(t,B), 
which are added to the list L[t] (line 6). At the end of line 
6, we have obtained, for each tuple t, the list of statistics 
L[t] about its matching buckets. A final scan of the tuples 
in T will compute the p(t, s) values based on the law of total 
probability described in Section 13.21 Specifically, 

p(t,s) = ]T e.p(t,B)*e.D(t,B)[s] 

e££[t] 

The sliced table is ^-diverse iff for all sensitive value s, 
p(t,s) <l/£ (line 7 to line 10). 

We now analyze the time complexity of the tuple-partition 
algorithm. The time complexity of Mondrian [17] or kd- 
tree [10] is 0(n log n) because at each level of the kd-tree, 
the whole dataset need to be scanned which takes 0(n) time 
and the height of the tree is O(logn). In our modification, 
each level takes 0(n 2 ) time because of the diversity-check 
algorithm (note that the number of buckets is at most n). 
The total time complexity is therefore 0(n 2 logn). 

5. MEMBERSHIP DISCLOSURE PRO- 
TECTION 

Let us first examine how an adversary can infer member- 
ship information from bucketization. Because bucketization 
releases the QI values in their original form and most indi- 
viduals can be uniquely identified using the QI values, the 
adversary can simply determine the membership of an in- 
dividual in the original data by examining the frequency of 
the QI values in the bucketized data. Specifically, if the fre- 
quency is 0, the adversary knows for sure that the individual 
is not in the data. If the frequency is greater than 0, the 
adversary knows with high confidence that the individual 
is in the data, because this matching tuple must belong to 
that individual as almost no other individual has the same 
QI values. 

The above reasoning suggests that in order to pro- 
tect membership information, it is required that, in the 
anonymized data, a tuple in the original data should have 
a similar frequency as a tuple that is not in the original 



data. Otherwise, by examining their frequencies in the 
anonymized data, the adversary can differentiate tuples in 
the original data from tuples not in the original data. 

We now show how slicing protects against membership 
disclosure. Let D be the set of tuples in the original data 
and let D be the set of tuples that are not in the original 
data. Let D a be the sliced data. Given D s and a tuple t, the 
goal of membership disclosure is to determine whether t G D 
or t G D. In order to distinguish tuples in D from tuples in 
D, we examine their differences. If t G D, t must have at 
least one matching buckets in D s . To protect membership 
information, we must ensure that at least some tuples in D 
should also have matching buckets. Otherwise, the adver- 
sary can differentiate between t G D and t G D by examining 
the number of matching buckets. 

We call a tuple an original tuple if it is in D. We call a 
tuple a fake tuple if it is in D and it matches at least one 
bucket in the sliced data. Therefore, we have considered 
two measures for membership disclosure protection. The 
first measure is the number of fake tuples. When the num- 
ber of fake tuples is (as in bucketization) , the membership 
information of every tuple can be determined. The second 
measure is to consider the number of matching buckets for 
original tuples and that for fake tuples. If they are sim- 
ilar enough, membership information is protected because 
the adversary cannot distinguish original tuples from fake 
tuples. 

Slicing is an effective technique for membership disclosure 
protection. A sliced bucket of size k can potentially match 
k c tuples. Besides the original k tuples, this bucket can in- 
troduce as many as k c — k tuples in D, which is fc c_1 — 1 
times more than the number of original tuples. The exis- 
tence of such tuples in D hides the membership information 
of tuples in D, because when the adversary finds a matching 
bucket, she or he is not certain whether this tuple is in D or 
not since a large number of tuples in D have matching buck- 
ets as well. In our experiments (Section [6} , we empirically 
evaluate slicing in membership disclosure protection. 

6. EXPERIMENTS 

We conduct two experiments. In the first experiment, we 
evaluate the effectiveness of slicing in preserving data utility 
and protecting against attribute disclosure, as compared to 
generalization and bucketization. To allow direct compari- 
son, we use the Mondrian algorithm [T7] and ^-diversity for 
all three anonymization techniques: generalization, bucke- 
tization, and slicing. This experiment demonstrates that: 

(1) slicing preserves better data utility than generalization; 

(2) slicing is more effective than bucketization in workloads 
involving the sensitive attribute; and (3) the sliced table 
can be computed efficiently. Results for this experiment are 
presented in Section \6. 2 1 

In the second experiment, we show the effectiveness of 
slicing in membership disclosure protection. For this pur- 
pose, we count the number of fake tuples in the sliced data. 
We also compare the number of matching buckets for origi- 
nal tuples and that for fake tuples. Our experiment results 
show that bucketization does not prevent membership dis- 
closure as almost every tuple is uniquely identifiable in the 
bucketized data. Slicing provides better protection against 
membership disclosure: (1) the number of fake tuples in the 
sliced data is very large, as compared to the number of orig- 
inal tuples and (2) the number of matching buckets for fake 





Attribute 


Type 


# of values 


1 


Age 


Continuous 


74 


2 


Workclass 


Categorical 


8 


3 


Final- Weight 


Continuous 


NA 


4 


Education 


Categorical 


16 


5 


Education-Num 


Continuous 


16 


6 


Marital-Status 


Categorical 


7 


7 


Occupation 


Categorical 


14 


8 


Relationship 


Categorical 


6 


9 


Race 


Categorical 


5 


10 


Sex 


Categorical 


2 


11 


Capital-Gain 


Continuous 


NA 


12 


Capital-Loss 


Continuous 


NA 


13 


Hours-Per-Week 


Continuous 


NA 


14 


Country 


Categorical 


41 


15 


Salary 


Categorical 


2 



Table 2: Description of the Adult dataset 

tuples and that for original tuples are close enough, which 
makes it difficult for the adversary to distinguish fake tu- 
ples from original tuples. Results for this experiment are 
presented in Section f6. 31 

Experimental Data. We use the Adult dataset from the 
UC Irvine machine learning repository [2J, which is com- 
prised of data collected from the US census. The dataset is 
described in Table [2] Tuples with missing values are elimi- 
nated and there are 45222 valid tuples in total. The adult 
dataset contains 15 attributes in total. 

In our experiments, we obtain two datasets from the Adult 
dataset. The first dataset is the "OCC-7" dataset, which 
includes 7 attributes: QI — {Age, Workclass, Education, 
Marital-Status, Race, Sex} and S = Occupation. The 
second dataset is the "OCC-15" dataset, which includes all 
15 attributes and the sensitive attribute is S — Occupation. 

In the "OCC-7" dataset, the attribute that has the closest 
correlation with the sensitive attribute Occupation is Gen- 
der, with the next closest attribute being Education. In the 
"OCC-15" dataset, the closest attribute is also Gender but 
the next closest attribute is Salary. 

6.1 Preprocessing 

Some preprocessing steps must be applied on the 
anonymized data before it can be used for workload tasks. 
First, the anonymized table computed through generaliza- 
tion contains generalized values, which need to be trans- 
formed to some form that can be understood by the classi- 
fication algorithm. Second, the anonymized table computed 
by bucketization or slicing contains multiple columns, the 
linking between which is broken. We need to process such 
data before workload experiments can run on the data. 

Handling generalized values. In this step, we map the 
generalized values (set/interval) to data points. Note that 
the Mondrian algorithm assumes a total order on the do- 
main values of each attribute and each generalized value is a 
sub-sequence of the total-ordered domain values. There are 
several approaches to handle generalized values. The first 
approach is to replace a generalized value with the mean 
value of the generalized set. For example, the generalized 
age [20,54] will be replaced by age 37 and the generalized 
Education level {9th, 10th, 11th} will be replaced by 10th. 
The second approach is to replace a generalized value by 
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(b) Naive Bayes (OCC-7) 
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(c) J48 (OCC-15) 



(d) Naive Bayes (OCC-15) 



Figure 3: Learning the sensitive attribute (Target: 
Occupation) 

its lower bound and upper bound. In this approach, each 
attribute is replaced by two attributes, doubling the total 
number of attributes. For example, the Education attribute 
is replaced by two attributes Lower-Education and Upper- 
Education; for the generalized Education level {9th, 10th, 
11th}, the Lower-Education value would be 9th and the 
Upper-Education value would be 11th. For simplicity, we 
use the second approach in our experiments. 

Handling bucketized/sliced data. In both bucketiza- 
tion and slicing, attributes are partitioned into two or more 
columns. For a bucket that contains k tuples and c columns, 
we generate k tuples as follows. We first randomly permu- 
tate the values in each column. Then, we generate the i-th 
(1 < i < k) tuple by linking the i-th value in each column. 
We apply this procedure to all buckets and generate all of 
the tuples from the bucketized/sliced table. This procedure 
generates the linking between the two columns in a random 
fashion. In all of our classification experiments, we applies 
this procedure 5 times and the average results are reported. 

6.2 Attribute Disclosure Protection 

We compare slicing with generalization and bucketization 
on data utility of the anonymized data for classifier learn- 
ing. For all three techniques, we employ the Mondrian algo- 
rithm [17] to compute the ^-diverse tables. The I value can 
take values {5,8,10} (note that the Occupation attribute has 
14 distinct values). In this experiment, we choose a = 2. 
Therefore, the sensitive column is always {Gender, Occupa- 
tion}. 

Classifier learning. We evaluate the quality of the 
anonymized data for classifier learning, which has been used 
in [111 1181 [4] . We use the Weka software package to evaluate 
the classification accuracy for Decision Tree C4.5 (J48) and 
Naive Bayes. Default settings are used in both tasks. For all 
classification experiments, we use 10-fold cross-validation. 
In our experiments, we choose one attribute as the tar- 
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Learning a QI attribute (Target: Educa- 



get attribute (the attribute on which the classifier is built) 
and all other attributes serve as the predictor attributes. 
We consider the performances of the anonymization algo- 
rithms in both learning the sensitive attribute Occupation 
and learning a QI attribute Education. 

Learning the sensitive attribute. In this experiment, 
we build a classifier on the sensitive attribute, which is "Oc- 
cupation". We fix c = 2 here and evaluate the effects of c 
later in this section. Figure [3] compares the quality of the 
anonymized data (generated by the three techniques) with 
the quality of the original data, when the target attribute 
is Occupation. The experiments are performed on the two 
datasets OCC-7 (with 7 attributes) and OCC-15 (with 15 
attributes). 

In all experiments, slicing outperforms both generalization 
and bucketization, that confirms that slicing preserves at- 
tribute correlations between the sensitive attribute and some 
QIs (recall that the sensitive column is {Gender, Occupa- 
tion}). Another observation is that bucketization performs 
even slightly worse than generalization. That is mostly due 
to our preprocessing step that randomly associates the sen- 
sitive values to the QI values in each bucket. This may 
introduce false associations while in generalization, the as- 
sociations are always correct although the exact associations 
are hidden. A final observation is that when I increases, the 
performances of generalization and bucketization deteriorate 
much faster than slicing. This also confirms that slicing pre- 
serves better data utility in workloads involving the sensitive 
attribute. 

Learning a QI attribute. In this experiment, we build a 
classifier on the QI attribute "Education" . We fix c = 2 here 
and evaluate the effects of c later in this section. Figure [4] 
shows the experiment results. 

In all experiments, both bucketization and slicing per- 
form much better than generalization. This is because in 
both bucketization and slicing, the QI attribute Education 
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Figure 5: Varied c values 

is in the same column with many other QI attributes: in 
bucketization, all QI attributes are in the same column; in 
slicing, all QI attributes except Gender are in the same col- 
umn. This fact allows both approaches to perform well in 
workloads involving the QI attributes. Note that the clas- 
sification accuracies of bucketization and slicing are lower 
than that of the original data. This is because the sensitive 
attribute Occupation is closely correlated with the target 
attribute Education (as mentioned earlier in Section [6] Ed- 
ucation is the second closest attribute with Occupation in 
OCC-7). By breaking the link between Education and Oc- 
cupation, classification accuracy on Education reduces for 
both bucketization and slicing. 

The effects of c. In this experiment, we evaluate the 
effect of c on classification accuracy. We fix I = 5 and vary 
the number of columns c in {2,3,5}. Figure [5j a) shows the 
results on learning the sensitive attribute and Figure EJb) 
shows the results on learning a QI attribute. It can be seen 
that classification accuracy decreases only slightly when we 
increase c, because the most correlated attributes are still 
in the same column. In all cases, slicing shows better accu- 
racy than generalization. When the target attribute is the 
sensitive attribute, slicing even performs better than bucke- 
tization. 

6.3 Membership Disclosure Protection 

In the second experiment, we evaluate the effectiveness of 
slicing in membership disclosure protection. 

We first show that bucketization is vulnerable to member- 
ship disclosure. In both the OCC-7 dataset and the OCC-15 
dataset, each combination of QI values occurs exactly once. 
This means that the adversary can determine the member- 
ship information of any individual by checking if the QI value 
appears in the bucketized data. If the QI value does not ap- 
pear in the bucketized data, the individual is not in the orig- 
inal data. Otherwise, with high confidence, the individual is 
in the original data as no other individual has the same QI 
value. 

We then show that slicing does prevent membership dis- 
closure. We perform the following experiment. First, we 
partition attributes into c columns based on attribute cor- 
relations. We set c £ {2,5}. In other words, we com- 
pare 2-column-slicing with 5-column-slicing. For example, 
when we set c = 5, we obtain 5 columns. In OCC-7, 
{Age, Marriage, Gender} is one column and each other at- 
tribute is in its own column. In OCC-15, the 5 columns are: 
{Age, Workclass, Education, Education- Num, Cap-Gain, 
Hours, Salary}, {Marriage, Occupation, Family, Gender}, 
{Race, Country} , {Final- Weight}, and {Cap-Loss}. 
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Figure 6: Number of fake tuples 

Then, we randomly partition tuples into buckets of size p 
(the last bucket may have fewer than p tuples) . As described 
in Section [5j we collect statistics about the following two 
measures in our experiments: (1) the number of fake tuples 
and (2) the number of matching buckets for original v.s. the 
number of matching buckets for fake tuples. 

The number of fake tuples. Figure [6] shows the experi- 
mental results on the number of fake tuples, with respect to 
the bucket size p. Our results show that the number of fake 
tuples is large enough to hide the original tuples. For exam- 
ple, for the OCC-7 dataset, even for a small bucket size of 
100 and only 2 columns, slicing introduces as many as 87936 
fake tuples, which is nearly twice the number of original tu- 
ples (45222). When we increase the bucket size, the number 
of fake tuples becomes larger. This is consistent with our 
analysis that a bucket of size k can potentially match k c — k 
fake tuples. In particular, when we increase the number of 
columns c, the number of fake tuples becomes exponentially 
larger. In almost all experiments, the number of fake tuples 
is larger than the number of original tuples. The existence 
of such a large number of fake tuples provides protection for 
membership information of the original tuples. 

The number of matching buckets. Figure [7] shows 
the number of matching buckets for original tuples and fake 
tuples. 

We categorize the tuples (both original tuples and fake 
tuples) into three categories: (1) < 10: tuples that have at 
most 10 matching buckets, (2) 10—20: tuples that have more 
than 10 matching buckets but at most 20 matching buckets, 
and (3) > 20: tuples that have more than 20 matching buck- 
ets. For example, the "original-tuples (< 10)" bar gives the 
number of original tuples that have at most 10 matching 
buckets and the "fake-tuples(> 20)" bar gives the number of 
fake tuples that have more than 20 matching buckets. Be- 
cause the number of fake tuples that have at most 10 match- 
ing buckets is very large, we omit the "fake-tuples (< 10)" bar 
from the figures to make the figures more readable. 

Our results show that, even when we do random grouping, 
many fake tuples have a large number of matching buckets. 
For example, for the OCC-7 dataset, for a small p — 100 
and c = 2, there are 5325 fake tuples that have more than 
20 matching buckets; the number is 31452 for original tuples. 
The numbers are even closer for larger p and c values. This 
means that a larger bucket size and more columns provide 
better protection against membership disclosure. 

Although many fake tuples have a large number of match- 
ing buckets, in general, original tuples have more matching 
buckets than fake tuples. As we can see from the figures, a 



Number of Tuples 



6-10 
5-1 4 
4-10 4 
3-10 4 
210* 
1-10* 
0-10° 



410 
4-10* 
310* 
2 10* 
2 10* 
2 10* 
1 10* 
510 3 
010° 



original-tuples(<=10) I 

original-tuples(10-20) I 

original-luples(>20) I 

faked-tuples(10-20) c 

faked-fuples(>20) I 



m_ 



EL 



km 



6-10 
5-10* 
4-10 4 
3-10* 
2-10* 
1-10* 



original-tuples(<=10) I 

original-tuples(10-20) i 

original-tuples(>20) c 

faked-tuples(10-20) c 

faked-!uples(>20) I 



i 



10 100 500 1000 
p value 

(a) 2-column (OCC-7) 

Number of Tuples 

original-tuples(<=10) 
original tuples(1 20) 
original-tuples(>20) i i 

faked-tuples(10-20) i 1 

i I faked-tuples(>20) 



p value 

(b) 5-column (OCC-7) 

Number of Tuples 



410 
3-10* 
2-10* 
1-10* 
0-10° 



original-tuples(<=10) 
origi nal-tuplesf 1 0-20) 
original-!uples(>20) 
faked-tuples(10-20) 
fakea4uples(>20j 



p value 

(c) 2-column (OCC-15) 



10 100 500 1000 
p value 

(d) 5-column (OCC-15) 



Figure 7: 
buckets 



Number of tuples that have matching 



large fraction of original tuples have more than 20 matching 
buckets while only a small fraction of fake tuples have more 
than 20 tuples. This is mainly due to the fact that we use 
random grouping in the experiments. The results of random 
grouping are that the number of fake tuples is very large but 
most fake tuples have very few matching buckets. When we 
aim at protecting membership information, we can design 
more effective grouping algorithms to ensure better protec- 
tion against membership disclosure. The design of tuple 
grouping algorithms is left to future work. 

7. RELATED WORK 

Two popular anonymization techniques are generalization 
and bucketization. Generalization [291 1311 130] replaces a 
value with a "less-specific but semantically consistent" value. 
Three types of encoding schemes have been proposed for 
generalization: global recoding, regional recoding, and local 
recoding. Global recoding has the property that multiple 
occurrences of the same value are always replaced by the 
same generalized value. Regional record [T7] is also called 
multi-dimensional recoding (the Mondrian algorithm) which 
partitions the domain space into non-intersect regions and 
data points in the same region are represented by the region 
they are in. Local recoding does not have the above con- 
straints and allows different occurrences of the same value 
to be generalized differently. 

Bucketization [351 1251 116] first partitions tuples in the 
table into buckets and then separates the quasi-identifiers 
with the sensitive attribute by randomly permuting the sen- 
sitive attribute values in each bucket. The anonymized 
data consists of a set of buckets with permuted sensitive 
attribute values. In particular, bucketization has been used 
for anonymizing high-dimensional data 12 ; . Please refer to 
Sect ion 12.21 and Section r2.3l for a detailed comparison of slic- 
ing with generalization and bucketization, respectively. 

Slicing has some connections to marginal publication [TS] ; 
both of them release correlations among a subset of at- 



tributes. Slicing is quite different from marginal publica- 
tion in a number of aspects. First, marginal publication 
can be viewed as a special case of slicing which does not 
have horizontal partitioning. Therefore, correlations among 
attributes in different columns are lost in marginal publica- 
tion. By horizontal partitioning, attribute correlations be- 
tween different columns (at the bucket level) are preserved. 
Marginal publication is similar to overlapping vertical par- 
titioning, which is left as our future work (See Section [8}. 
Second, the key idea of slicing is to preserve correlations be- 
tween highly-correlated attributes and to break correlations 
between uncorrelated attributes, thus achieving both bet- 
ter utility and better privacy. Third, existing data analysis 
(e.g., query answering) methods can be easily used on the 
sliced data. 

Existing privacy measures for membership disclosure 
protection include differential privacy [7] [5] [5] and 5- 
presence [27] • Differential privacy has recently received 
much attention in data privacy, especially for interactive 
databases [3J HI H [3J]- Rastogi et al. [2S] design the 
ctf3 algorithm for data perturbation that satisfies differential 
privacy. Machanavajjhala et al. [24] apply the notion of dif- 
ferential privacy for synthetic data generation. On the other 
hand, rj-presence [27] assumes that the published database 
is a sample of a large public database and the adversary 
has knowledge of this large database. The calculation of 
disclosure risk depends on this large database. 

Finally, privacy measures for attribute disclosure pro- 
tection include ^-diversity [23], (a, fc)-anonymity [33], t- 
closeness [20], (k, e)-anonymity [16], (c, fe)-safety [25] . 
privacy skyline [5], m-confidentiality [33] and (e, m)- 
anonymity [TS]. We use ^-diversity in slicing for attribute 
disclosure protection. 

8. DISCUSSIONS AND FUTURE WORK 

This paper presents a new approach called slicing to 
privacy-preserving microdata publishing. Slicing overcomes 
the limitations of generalization and bucketization and pre- 
serves better utility while protecting against privacy threats. 
We illustrate how to use slicing to prevent attribute disclo- 
sure and membership disclosure. Our experiments show that 
slicing preserves better data utility than generalization and 
is more effective than bucketization in workloads involving 
the sensitive attribute. 

The general methodology proposed by this work is that: 
before anonymizing the data, one can analyze the data char- 
acteristics and use these characteristics in data anonymiza- 
tion. The rationale is that one can design better data 
anonymization techniques when we know the data better. 
In [21] , we show that attribute correlations can be used for 
privacy attacks. 

This work motivates several directions for future research. 
First, in this paper, we consider slicing where each attribute 
is in exactly one column. An extension is the notion of over- 
lapping slicing, which duplicates an attribute in more than 
one columns. This releases more attribute correlations. For 
example, in TableQJf), one could choose to include the Dis- 
ease attribute also in the first column. That is, the two 
columns are {Age, Sex, Disease} and {Zipcode, Disease}. 
This could provide better data utility, but the privacy im- 
plications need to be carefully studied and understood. It is 
interesting to study the tradeoff between privacy and util- 
ity [22]. 



Second, we plan to study membership disclosure protec- 
tion in more details. Our experiments show that random 
grouping is not very effective. We plan to design more effec- 
tive tuple grouping algorithms. 

Third, slicing is a promising technique for handling high- 
dimensional data. By partitioning attributes into columns, 
we protect privacy by breaking the association of uncor- 
rected attributes and preserve data utility by preserving 
the association between highly-correlated attributes. For 
example, slicing can be used for anonymizing transaction 
databases, which has been studied recently in |32| 1371 [26] . 

Finally, while a number of anonymization techniques have 
been designed, it remains an open problem on how to use 
the anonymized data. In our experiments, we randomly gen- 
erate the associations between column values of a bucket. 
This may lose data utility. Another direction to design data 
mining tasks using the anonymized data |13| computed by 
various anonymization techniques. 
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